\section{The Data Sources}

Natural Language researchers working on our age should be quite
excited, as Web trends like social networks and crowed sourcing offer
a practically endless source of data. Most of it is in English, then
we picked such as our language for this project. \\

For the word categorization problem, we plan to pick no more than 10
books from the Project Gutemberg \cite{guten} (which can be
downloaded in several formats, including plain UTF-8 text). Kohonen for
example, used just the Grimm Tales in \cite{kohonen-semap}, which
contain around 5000 unique words of vocabulary and nearly 100,000
occurrences. The pre-classification of words will be achieved with the
POS-tagger of NLTK. \\

For the topic classification of text, we have not decided yet about
the source, but we want medium size documents with a pre-tagged topic
(so we can compare with the SOM discovered classification). Possible
candidates are news or phorum archives, or public blog articles. \\

For visualizing the outcome of the learning algorithm, we
simply plan to represent the trained SOM as a pixel-map, and to assign a color
to each predefined category (sub-colors can represent nested
clustering). Then we will take the trainset (or a subset), and 
pass each vector to the SOM, in order to draw the pixel at
the winner position with the assigned color. This should suffice for
our purposes, although we may explore more sophisticated visualization
techniques available. This visualization technique implies that the
number of neurons (map dimension) can not be too big, thus, at most we
will assign one neuron per training vector. \\

Finally, one important remark is that the encoding of the data sources
is a very intensive computing endeavor (not numerically, like
the learning algorithm; but still data intensive). We will explore how
to efficiently encode our large data sets, either with Spark or Hadoop
(or a combination of both, as they are integrated over HDFS). \\
